Notes - L&C II emma, speech production + perception

Greg Detre

Wednesday, 11 October, 2000

 

Pinker - The Language Instinct1

Invitation to cognitive science, vol 1, Language2

Chapter 4 � pg 87, Werker, �Exploring developmental changes in cross-language speech perception�2

Introduction3

Maintenance-Loss model 3

Functional Reorganisation hypothesis3

Summary3

Chapter 73

Finding structure in time � Elman, 19903

L&C lectures4

Uttman lecture on speech production4

Phonology4

Bryant�s Developmental Prelims lectures5

Lecture on Infant perception (innate or acquired?) 5

Extract from Prelims essay on language acquisition6

Extract from Prelims essay on �Infant Perception�, re: VOT6

Gross7

Chapter 9 � Language pg 3457

Encarta � linguistics etc. 7

Gleitman�� 7

Chapter7

Misc7

Essay titles7

Points7

Questions8

 

 

Pinker - The Language Instinct

Index on �speech perception� - 158-163, 169, 181-188, 263-265

 

 

Invitation to cognitive science, vol 1, Language

Chapter 4 � pg 87, Werker, �Exploring developmental changes in cross-language speech perception�

Infants are able to discriminate between the acoustic nuances employed in any of the world�s human languages. These sound atoms, or phonemes, combine to form a repertoire of legitimate sounds (syllables) in a language � phonotactics.

Interestingly, though, infants cannot discriminate comparable acoustic contrasts outside the categories employed by any of the world�s languages, i.e. an infant�s ability to perceive speech sounds beyond those of any given adult native speaker cannot simply be ascribed to a universal acoustic sensitivity. Rather, the comparatively broad range of infants� speech perception is still restricted to the domain of speech sounds occurring as part of human language. This points to an interesting convergence between the speech sounds our bodies have evolved to produce, and the phonemes employed in any human language you care to mention. Within this universal set of phonemes that humans can produce, different languages delineate their own repertoire of phonemes from which legitimate syllables can be formed.

Adults, on the other hand, remain easily able to discriminate the phonemes of their native language but seem to lose the ability to discriminate phoneme contrasts employed by other languages but not their own.

 

Almost every child learns to speak almost effortlessly given sufficient opportunity, yet adults find it extremely difficult to learn a second language. Similarly, those who have to learn language in later life through brain injury of lack of human contact never attain the accentless fluency that every child manages, sometimes in more than one language. This led Lenneberg (1967???) to postulate a critical window of language-learning, ending before the onset of puberty, after which our ability to acquire language has markedly diminished.

This prompted the hypothesis that there might be some critical window of experience necessary to maintain our ability to perceive phonemic contrasts. If brought up in a wholly English-speaking background, then a child�s ability to distinguish between the different pronunciations of �t� (as in Hindi) would be lost by a certain age, for instance. Experiments comparing infants (under 1 year old) with children of 12 and 8 showed that this critical window of universal acoustic perceptual ability is even narrower than for accentless language learning in general. Indeed, 4-year old children perform worse than adults, and thus significantly worse than infants, in recognising non-native phonemic contrasts.

 

The phonemes /ba/ and /pa/ are differentiated by the length of time between movement of the vocal folds and air pressure being produced against them by the voice box. This is known as the voice onset time (VOT). For example, /ba/ has a VOT of 30ms, while /pa/ has a VOT of 60ms. Speech perception is partly the business of categorising continua, like VOT, in order to distinguish different phonemes in order to recognise words.

Studies on the ability to discriminate the VOT of different phonemes have raised some interesting issues. As would be expected, native speakers of English are very good at distinguishing between /ba/ and /pa/, since they are two functionally separate phonemes in English. Functionally separate phonemes affect the meaning in a given language on the basis of their differing acoustic properties, e.g. �bat� and �pat�.

On the other hand, if two sounds are not categorised as separate phonemes in a language, then it seems reasonable to predict that speakers of that language will find it difficult to distinguish those sounds, because they will be unused to hearing them in separate contexts.

Though individual speakers of a language may indeed pronounce every phoneme slightly differently, those differences do not affect the meaning of their speech in any way. Such phonemic variations which are categorised together and carry no meaning are known as �allophones�. Accents are when an individual speaker�s allophones are different enough from the norm to be noticeable by another. It is now apparent why developmental changes in our ability to detect acoustic differences play such a crucial role in learning to speak a language flawlessly. If we cannot detect these subtle differences, we will never attain the proficiency of a native speaker.

 


Release    VO
   | . . . . . . | . . . . . . . | . . . . . . .|
Fr. b      Eng. b       Fr. p         Eng. p

, show that infants are able to

 

connection between Functional Reorganisation hypothesis and the vocabulary spurt.

 

Introduction

Maintenance-Loss model

Functional Reorganisation hypothesis

4 year olds do even worse

Summary

 

Chapter 7

 

Finding structure in time � Elman, 1990

There is often a need to represent time, the serial flow of events, in a PDP (parallel) system. One method is to explicitly represent time spatially, with the organisation of the inputs representing the chronological sequence. But Elman wants to implicitly represent time by its effects on processing.

He first tackles a temporal version of the XOR problem, supplying the NN with a continual sequence of bits, with its task being to predict the next bit. His NN has just 1 input and 1 output, and 2 hidden units which map directly onto 2 �context units� (which form a sequential memory of the bit that came immediately before). As each bit is fed into the input, the hidden units process and copy it to the context units, and one bit is output. The stream might look like 1 0 1 0 0 0 0 1 1 1 1 0 1 0 1. . . ( i.e. 101,000,011,110,101�) when seen in triplets of 2 inputs and an output XOR operation. His network was able to predict every third bit with a low error (apparently by using the XOR rule all the time), stumbling over effectively random XOR pairs, plus the initial bit. This provided a first indication of how a temporal solution might change the nature of a problem compared to a static solution, and how the predictive pattern and error signal which emerge can indicate something about the temporal structure of the problem.

He demonstrated this further with larger input sizes in a larger NN with a similar �memory� structure on a 1000-bit semi-random letter sequence.

His last experiment employeed 29 different roughly categorised words, e.g. noun-human, noun-food, verb-transitive, verb-destroy and a syntax of legitimate arrangements of 3-word patterns, e.g. noun-human verb-eat noun-food. Each word was represented orthogonally by a 31-bit vector, e.g. 0000000000000000000000001000000000.The NN was fed 6 passes of a huge stream of all 27,000 legitimate combinations of these 3-word sentences all concatenated together. He was able to represent the strength of the relationships between the different words graphically. The NN had arborised the categories into the 29 types, with the full tree branching down to every single individual token possibility word combination.

He drew several conclusions about the possibilities of such PDP models vs traditional symbolic attempts and non-temporal connectionist models. His temporal PDP models demonstrated new ways of learning about the patterns within the data (e.g. by using the systematic fluctuations in the error signal) , which might even be fed back into the NN. Token/type issues are resolved by this new method of representation, which incorporates both seamlessly. Moreover, the capacity of such NNs is potentially infinite in a fully analogue system, but certainly excitingly large and more structured and informative than imagined.

 

L&C lectures

Uttman lecture on speech production

Voicing �ba� and �pa�

both labial (articulated by the lips), but they differ in the lag time between the opening of the lips and the opening/closing of the vocal cords = voice onset time (VOT)

ba VOT = 0-20ms (some people even pre-voice)(= a voiced consonant)

pa VOT = 40-120ms���������������������������������������������� (= a voiceless consonant)

Phonology

languages vary in their phonemic inventory

phoneme = the smallest distinguishable sound in a language

minimal pairs = words that vary by only one phoneme

each phoneme can be made up of a selection of features in phoneme space (e.g. both �pa� and �ba� are bilabial, stopped, but �pa� is aspirated and not voiced, and vice versa for �ba�)

aspiration = a breathy �hhhh� sound, but is not considered a phonemic contrast

allophones = variations for the same phoneme, e.g. aspirated �pit� vs non-aspirated �spit�, �top� vs �stop�

aspiration is predictable according to a phonological rule, in English (for any stopped unvoiced consonant)

phonotactics = the rules governing which sounds can be combined to form legitimate words in a language

debates about the fundamental unit of speech sounds (syllable vs phoneme etc.) � though many speech sounds require a consonant-vowel combination to produce

 

Bryant�s Developmental Prelims lectures

Lecture on Infant perception (innate or acquired?)

perceptual categories; discontinuities along a continuum

many perceptual continue which we treat as discont: e.g. reds/yellows/blues

although they�re merely different points along same continuum of wavelength, subj = differentiated

mechanisms to break up continua into categs (e.g. in the visual categs of colour, shape, angle)

also in speech

 

VOT � voice onset time

consonant � release breath and make articulatory movements, all different for the different consonants � also vibrate the vocal chord

variations in the interval bewteen rleease of breath & vocal cord movement

 

VOT = continuum

sounds involving the same articulatory movements vary along the VOT continuum

can distinguish between but not within perceptual categories, e.g. �b� vs �p� (30ms = cut-off point)

 

HAS � high amplitude sucking

looking at speec catgories in infants

HAS babies given a dummy wired up so that sucking maintains sound

babies learn this very quickly, and they do suck more for new sounds

if they hear the same sound, they suck less

if they hear a new sound, they start sucking more

Eimas � possibility of �b� and �p� speech categories in 1 and 4 monthainfants in 3 group experiemnt

1.     20 40, 40 20 (between �b� and �p�)

2.     0 20, 40 60 (within, indistinguishable to us)

3.     control � sound unchanged

cross-changes big increases in rate of sucking

within 1-month babies slight increase, 4-month decrease

no change decreases

newborn babies able to make discrims which english adults cannot make � they drop categories

 

Extract from Prelims essay on language acquisition

By the age of about 6, children have learnt over 15,000 words, equivalent to an average of about one word per waking hour. Carey described this process as �fast-mapping�, demonstrating that 4 and 5 year olds when presented with a new word (the colour �chromium�) in the context of a sentence would be able to pick out the chromium tray from eight others a week later.

Extract from Prelims essay on �Infant Perception�, re: VOT

 

Experimenters also took advantage of human subjects� propensity to divide continua such as colour, shape and angle into categories. One such perceptual category is voice onset time (VOT), the interval between release of breath and vocal cord movement, used by adults in recognising different speech sounds such as �b� and �p�. Adult English speakers can distinguish between but not within a VOT perceptual category with a cut-off point of 30 milliseconds, less than which a �b� sound is heard, and greater than which a �p� sound is heard.

With the technique of High Amplitude Sucking (HAS), babies are given a dummy wired up so that sucking causes a sound to be heard, which they soon learn to associate and suck harder as a consequence. Eimas et al. used babies of one and four months, first habituating them to a sound of a certain VOT, and then changing to a second sound after and measuring whether the infants registered the difference by sucking harder again. The experimenters used three groups: one where the changes of sound were both up and down over the 30 millisecond cut-off point for adult English speakers� �b�-�p� perceptual category; one where the changes stayed either side of the 30 millisecond boundary, indistinguishable to an adult; and one control where the sound did not change at all.

Changes in sound across the boundary resulted in a large increase in the rate of sucking for both age-groups of baby, indicating that even infants as young as one month old have the ability to distinguish such small changes in the VOT. The results were supported by the control group, where the sucking rate decreased over time, especially for the younger babies. Interestingly however, the group of babies where the sound involved a change in the VOT within perceptual categories noticeable to adult English-speakers, differentiated between the one month and four month old babies. The older babies� sucking rate diminished by exactly the same amount as the control group�s, while the younger babies seemed able to distinguish even sound changes which did not cross the 30ms cut-off point; thus it seems that this more precise perceptual ability is lost over time as we become used to hearing only the sounds employed by speakers around us, i.e. we lose categories.

 

Gross

Chapter 9 � Language pg 345

 

 

 

 

Encarta � linguistics etc.

Structural and descriptive linguists view spoken language as having a hierarchical structure of three levels: sounds, sound combinations (such as words), and word combinations (sentences). At the phonemic level, sounds are analysed; at the morphemic level, the combination of sounds into meaningful units of speech (morphemes, that is, words or word-building units) is described; and at the syntactic level, the combination of words in sentences and clauses is the focus.

 

Gleitman

Chapter

 

Misc

Essay titles

1.     Evaluate the current status of the Motor Theory of Speech Perception. What are its primary drawbacks?

2.     Is the categorical perception of speech a specifically linguistic skill?

3.     How do you think the initial, universal speech categories of infancy might become modified as a particular language is learned?

What are the interesting issues in this area? Please brainstorm and write down anything you can think of.

 

Points

Werker gets things mixed up when she talks about evolution and language sounds. Have our voice production mechanisms evolved to better suit the languages we speak? Surely not. The sounds which the world�s human languages employ are selected from those which our voice boxes can best produce. Some languages use a wider repertoire of vowel/consonant sounds than others, but they are all drawn from the set of phonemes the human body can produce, and perceive, hence infants� ease sin learning to perceive �non-native� sounds. Being a native in a language is determined by speaking it as a first language, rather than being born of native speakers.

 

Questions

What does �pronunciation� mean? Is it being able to very accurately produce the phonemes used in that language, or is it to do with combinations of them?

Is a legitimate sound in a given language determined by the phonemes you use, or the combination of them (i.e. the syllables)?

Is Werker he/she?

Can infants learn perfectly any language they are immersed in, no matter what their race?

Confusion over whether or not infants/adults are better/worse at detecting acoustic differences within categories not employed by human speech.

In FSIT, how is the representational space of the hidden nodes delineated? How can he tell that one pair of lexical items/types are closer than another?

At what age do we start to lose the ability to distinguish phonemic contrast? According to my essay on Bryant�s lecture on VOT, there is a decline between the age of 1 and 4 months, but Werker refers mainly to 6 month babies and older.